R version 4.1.1 “Kick Things”
library(readr)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(GGally)
library(gridExtra)
library(plotly)## country year sex age
## Length:27820 Min. :1985 Length:27820 Length:27820
## Class :character 1st Qu.:1995 Class :character Class :character
## Mode :character Median :2002 Mode :character Mode :character
## Mean :2001
## 3rd Qu.:2008
## Max. :2016
##
## suicides_no population suicides/100k pop country-year
## Min. : 0 Min. : 278 Min. : 0.00 Length:27820
## 1st Qu.: 3 1st Qu.: 97498 1st Qu.: 0.92 Class :character
## Median : 25 Median : 430150 Median : 5.99 Mode :character
## Mean : 243 Mean : 1844794 Mean : 12.82
## 3rd Qu.: 131 3rd Qu.: 1486143 3rd Qu.: 16.62
## Max. :22338 Max. :43805214 Max. :224.97
##
## HDI for year gdp_for_year ($) gdp_per_capita ($) generation
## Min. :0 Min. :4.69e+07 Min. : 251 Length:27820
## 1st Qu.:1 1st Qu.:8.99e+09 1st Qu.: 3447 Class :character
## Median :1 Median :4.81e+10 Median : 9372 Mode :character
## Mean :1 Mean :4.46e+11 Mean : 16866
## 3rd Qu.:1 3rd Qu.:2.60e+11 3rd Qu.: 24874
## Max. :1 Max. :1.81e+13 Max. :126352
## NA's :19456
countrysexagesuicidespopulationsuicides per100k`gdp per capitayearage_levels <- c("5-14", "15-24", "25-34", "35-54", "55-74", "75+")
suic_fmt <- suicides %>%
mutate(country = factor(country),
year = year,
sex = factor(sex),
age = factor(str_replace(age, " years", ""), levels = age_levels),
suicides_p100k = `suicides/100k pop`,
gdp_per_capita = `gdp_per_capita ($)`) %>%
select(country, year, sex, age, suicides_no, population, suicides_p100k, gdp_per_capita)
print(summary(suic_fmt))## country year sex age suicides_no
## Austria : 382 Min. :1985 female:13910 5-14 :4610 Min. : 0
## Iceland : 382 1st Qu.:1995 male :13910 15-24:4642 1st Qu.: 3
## Mauritius : 382 Median :2002 25-34:4642 Median : 25
## Netherlands: 382 Mean :2001 35-54:4642 Mean : 243
## Argentina : 372 3rd Qu.:2008 55-74:4642 3rd Qu.: 131
## Belgium : 372 Max. :2016 75+ :4642 Max. :22338
## (Other) :25548
## population suicides_p100k gdp_per_capita
## Min. : 278 Min. : 0.00 Min. : 251
## 1st Qu.: 97498 1st Qu.: 0.92 1st Qu.: 3447
## Median : 430150 Median : 5.99 Median : 9372
## Mean : 1844794 Mean : 12.82 Mean : 16866
## 3rd Qu.: 1486143 3rd Qu.: 16.62 3rd Qu.: 24874
## Max. :43805214 Max. :224.97 Max. :126352
##
Nothing out of the ordinary here. It seems like we read in the whole file and do not need to skip any header or footer miscellaneous data.
## # A tibble: 6 x 8
## country year sex age suicides_no population suicides_p100k gdp_per_capita
## <fct> <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Albania 1987 male 15-24 21 312900 6.71 796
## 2 Albania 1987 male 35-54 16 308000 5.19 796
## 3 Albania 1987 fema… 15-24 14 289700 4.83 796
## 4 Albania 1987 male 75+ 1 21800 4.59 796
## 5 Albania 1987 male 25-34 9 274300 3.28 796
## 6 Albania 1987 fema… 75+ 1 35600 2.81 796
## # A tibble: 6 x 8
## country year sex age suicides_no population suicides_p100k gdp_per_capita
## <fct> <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Uzbeki… 2014 fema… 25-34 162 2735238 5.92 2309
## 2 Uzbeki… 2014 fema… 35-54 107 3620833 2.96 2309
## 3 Uzbeki… 2014 fema… 75+ 9 348465 2.58 2309
## 4 Uzbeki… 2014 male 5-14 60 2762158 2.17 2309
## 5 Uzbeki… 2014 fema… 5-14 44 2631600 1.67 2309
## 6 Uzbeki… 2014 fema… 55-74 21 1438935 1.46 2309
Sorting by suicide counts descending tells us that Russian men age 35-54 had the highest suicide raw counts around the 1990s to early 2000s, but this is not scaled by population yet. If we instead look at suicides per 100k persons, will the same trend appear?
## # A tibble: 1,467 x 8
## country year sex age suicides_no population suicides_p100k
## <fct> <dbl> <fct> <fct> <dbl> <dbl> <dbl>
## 1 Russian Federation 1994 male 35-54 22338 19044200 117.
## 2 Russian Federation 1995 male 35-54 21706 19249600 113.
## 3 Russian Federation 2001 male 35-54 21262 21476420 99
## 4 Russian Federation 2000 male 35-54 21063 21378098 98.5
## 5 Russian Federation 1999 male 35-54 20705 21016400 98.5
## 6 Russian Federation 1996 male 35-54 20562 19507100 105.
## 7 Russian Federation 1993 male 35-54 20256 18908000 107.
## 8 Russian Federation 2002 male 35-54 20119 21320535 94.4
## 9 Russian Federation 1997 male 35-54 18973 19913400 95.3
## 10 Russian Federation 2003 male 35-54 18681 21007346 88.9
## # … with 1,457 more rows, and 1 more variable: gdp_per_capita <dbl>
Russian men are no longer at the top of list, so it may have been due to their large population that so many suicides occurred. In both of these lists I see only male persons that are older as well, so maybe sex or age plays a factor here.
## # A tibble: 27,820 x 8
## country year sex age suicides_no population suicides_p100k
## <fct> <dbl> <fct> <fct> <dbl> <dbl> <dbl>
## 1 Aruba 1995 male 75+ 2 889 225.
## 2 Seychelles 2006 male 75+ 2 976 205.
## 3 Suriname 2012 male 75+ 10 5346 187.
## 4 Republic of Korea 2011 male 75+ 1276 688365 185.
## 5 Republic of Korea 2010 male 75+ 1152 631853 182.
## 6 Hungary 1992 male 75+ 317 178482 178.
## 7 Hungary 1993 male 75+ 300 168944 178.
## 8 Hungary 1991 male 75+ 333 188235 177.
## 9 Republic of Korea 2005 male 75+ 780 442349 176.
## 10 Hungary 1994 male 75+ 292 165660 176.
## # … with 27,810 more rows, and 1 more variable: gdp_per_capita <dbl>
# We can also look at the quantile breakdown for suicides per 100k persons:
quantile(suic_fmt$suicides_p100k, probs = seq(0, 1, 1/10))## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 0.00 0.00 0.41 1.60 3.54 5.99 9.09 13.56 20.53 33.29 224.97
# A boxplot can show us visually that the rates are pretty low except for some high outliers
boxplot(suic_fmt$suicides_p100k)Does sex or age contribute to historic suicide rates? Yes, it does seem like suicides are more common among men than women and elderly vs young folks, though this plot contains 30 years of data for over 100 countries.
ggplot(suic_fmt) +
geom_jitter(aes(x = sex, y = suicides_p100k, color = sex),
alpha = 0.2) +
facet_grid( ~ age)I first tried a regular histogram for the variable suicide_no, but there is such a pronounced right skew that applying a log to the x axis made sense. I believe population, suicides per 100k, and gdp per capita might have a similar problem in viewing their distributions with no transformations applied.
I created a log histogram and density plot function to help in making these 4 plots. The get() function can give ggplot the correct column name from an input “string” column name. If I was using Python, I would have created a dictionary and looped over it to get the column and axis labels, but R makes this process harder.
make_hist <- function(xlab, col) {
ggplot(suic_fmt) +
geom_histogram(aes(x = log(get(col)), y = ..density..)) +
geom_density(aes(x = log(get(col)))) +
labs(x = xlab, y = "Frequency")
}
h1 <- make_hist(xlab = "log(Suicides)", col = "suicides_no")
h2 <- make_hist(xlab = "log(Population)", col = "population")
h3 <- make_hist(xlab = "log(Suicides_Per100k)", col = "suicides_p100k")
h4 <- make_hist(xlab = "log(GDP_Per_Capita)", col = "gdp_per_capita")
grid.arrange(h1, h2, h3, h4, ncol = 2)Lastly, I would like to take a look at the year variable to see if there is any time-series trends at play.
suic_totals <- suic_fmt %>%
select(year, country, suicides_no, population) %>%
group_by(year, country) %>%
summarize(total_suicides = sum(suicides_no),
total_pop = sum(population),
total_suicides_p100k = total_suicides / total_pop * 100000)
# We are missing data from some years for some countries
# Show table of year, country, suicide rates
# pivot_longer(suic_totals, year,
ggplotly(ggplot(suic_totals) +
geom_line(aes(x = year, y = total_suicides_p100k, color = country, group = country),
alpha = 0.3, show.legend = FALSE))